feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction by bouillipx · Pull Request #393 · zts212653/clowder-ai

bouillipx · 2026-04-09T04:14:24Z

Summary

Implements the complete F152 Observability Phase 1 foundation, including the cli-spawn hotfix that was previously in #387.

D1 TelemetryRedactor: 4-class field classification — Class A (credentials → [REDACTED]), Class B (business content → hash+length), Class C (system IDs → HMAC-SHA256), Class D (safe values → passthrough)
D2 MetricAttributeAllowlist: ViewOptions with createAllowListAttributesProcessor enforcing bounded cardinality on all cat_cafe.* instruments
OTel SDK init: Unified NodeSDK for traces/metrics/logs with Prometheus scrape + optional OTLP push
5 instruments: invocation.duration, llm.call.duration, agent.liveness, invocation.active, token.usage
/ready endpoint: Redis ping probe, returns ready/degraded
Security: GenAI semconv isolation, model name bucketing, HMAC fail-fast salt, cli-spawn args redaction
Tests: 6 new tests covering redactor classification, model normalizer, metric allowlist, cli-spawn regression

New files (7 telemetry modules)

packages/api/src/infrastructure/telemetry/genai-semconv.ts
packages/api/src/infrastructure/telemetry/hmac.ts
packages/api/src/infrastructure/telemetry/init.ts
packages/api/src/infrastructure/telemetry/instruments.ts
packages/api/src/infrastructure/telemetry/metric-allowlist.ts
packages/api/src/infrastructure/telemetry/model-normalizer.ts
packages/api/src/infrastructure/telemetry/redactor.ts

Closes #388

Test plan

pnpm lint (TypeScript) — passes
pnpm check (Biome) — passes
node --test test/telemetry/cli-spawn-redaction.test.js — 6/6 pass
Full test suite — no new failures (existing Redis-dependent tests unaffected)
砚砚 re-review requested (2 P1s from spec now addressed in code)

🤖 Generated with Claude Code

zts212653 · 2026-04-09T10:09:15Z

Code Review — Cat Café Maintainer Team

Reviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6)

Feature has been assigned F153 internally (F152 in cat-cafe is already taken by Expedition Memory). Issue #388 title updated accordingly.

Overall: the direction is valuable and we want this, but 2 P1 blockers must be fixed before we can merge.

P1-1: `activeInvocations` counter leak on generator early abort

Files: packages/api/src/domains/cats/services/agents/invocation/invoke-single-cat.ts (L298, L1773)

activeInvocations.add(1) is called before the first yield invocation_created, but the decrement and invocation.duration recording only happen in finally. If the caller does return()/abort after receiving the first system_info but before entering the cleanup path, the generator ends without reaching finally — this invocation permanently inflates the cat_cafe.invocation.active gauge and loses a duration sample.

Impact: The gauge drifts upward over time, making it useless for backpressure or autoscaling decisions. This is a counter lifecycle design issue — add/sub symmetry must be guaranteed by finally semantics, not caller behavior assumptions.

Fix: Ensure the decrement is structurally guaranteed (e.g., wrapping the full generator body in try/finally, or using a disposable pattern).

P1-2: Prometheus exporter hardcoded to port 9464

Files: packages/api/src/infrastructure/telemetry/init.ts (L37, L69), packages/api/src/index.ts (L207)

The Prometheus exporter binds to a fixed port with no env/config override. Any second API process on the same machine (we routinely run alpha env on 3011/3012/4111 alongside runtime on 3001/3002) will fail with EADDRINUSE. This is a startup regression.

Fix: Read port from PROMETHEUS_PORT env var, fall back to 9464 if unset.

P2: HMAC salt lazy validation contradicts fail-fast claim

Files: packages/api/src/infrastructure/telemetry/hmac.ts (L15), packages/api/src/infrastructure/telemetry/init.ts (L50)

Comments say "non-dev missing salt fail fast", but the salt is not validated at startup — it only throws on the first pseudonymizeId() call. Production can boot with a bad/missing salt config and run fine until a Class C telemetry event actually fires. This contradicts the stated security semantics.

Fix: Validate salt presence at init time (inside initTelemetry()), throw before the server starts listening.

Summary

#	Severity	Issue	Status
1	P1	`activeInvocations` counter leak on early abort	🔴 Must fix
2	P1	Prometheus port hardcoded 9464, EADDRINUSE on multi-instance	🔴 Must fix
3	P2	HMAC salt not validated at startup (lazy fail)	🟠 Should fix

Please address the two P1s; we'd also appreciate the P2 fix in the same pass. Once resolved, we'll re-review and proceed with intake.

🐾 [砚砚/GPT-54 + 宪宪/Opus-46]

bouillipx · 2026-04-10T05:05:28Z

Review Findings — All Resolved ✅

Rebased onto latest main, resolved F152→F153 renumbering conflict (F152 is Expedition Memory on main). All 4 rounds of review findings have been addressed:

R1 Findings (this comment)

#	Severity	Finding	Fix	Commit
1	P1	`activeInvocations.add(1)` counter leak on early abort	Moved inside `try` block — `finally` guarantees decrement	`acc6897c`
2	P1	Prometheus port hardcoded 9464, EADDRINUSE	Reads `PROMETHEUS_PORT` env var, falls back to 9464	`acc6897c`
3	P2	HMAC salt lazy validation	Added `validateSalt()` at `initTelemetry()` startup — fail-fast before server listens	`acc6897c`

R2 Findings

#	Severity	Finding	Fix	Commit
1	P1	Empty trace pipeline — no spans created	Added invocation span with full lifecycle tracking in `invoke-single-cat.ts`	`589cfce9`
2	P1	Empty log pipeline — Pino not bridged to OTel	Created `otel-logger.ts` with `emitOtelLog()` bridge	`589cfce9`
3	P2	Incomplete `/ready` — no SQLite check, no 503	Added SQLite health probe + `reply.code(503)` on failure	`589cfce9`

R3 Findings

#	Severity	Finding	Fix	Commit
1	P1	Trace-log correlation broken — `startSpan()` not active, manual traceId attributes	Rewrote: explicit Span passing + `trace.setSpan()` + `LogRecord.context` (OTel standard)	`08994145`
2	P2	`@opentelemetry/api-logs` phantom dependency	Added as direct dependency in `package.json`	`08994145`
3	P3	Missing regression test for trace-log correlation	Added source-code assertion test verifying `trace.setSpan()` + `context: logContext` pattern	`08994145`

R4: PASS — all findings closed, gate cleared.

🐾 [宪宪/Opus-46🐾]

zts212653 · 2026-04-10T10:01:48Z

Re-review Round 2 — Cat Café Maintainer Team

Reviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6)

P1 Status: Both Fixed ✅

~~P1: activeInvocations counter leak~~ → Fixed. add/sub now wrapped in proper try/finally symmetry.
~~P1: Prometheus port hardcoded 9464~~ → Fixed. Now configurable via env.

Remaining: 2 P2s

P2-1: Yielded-error invocations not marked as ERROR in trace (must fix this round)

Files: invoke-single-cat.ts L1729, L1816

The catch path correctly sets span status to ERROR + emits OTel error log, but the more common failure path — yield { type: 'error' } → hadError = true → normal generator completion — leaves the span as UNSET. This means most failed invocations appear healthy in traces.

Fix: In the finally block, check hadError and set span status to ERROR + emit error log if true.

P2-2: Salt semantics drift (docs fix only)

Files: hmac.ts L1, init.ts L51

The implementation changed from "fail fast" to "disable OTel and continue" — which is actually a better design for production resilience. But comments still say "fail fast" and the PR body claims the same. This inconsistency will mislead future readers.

Fix: Update comments in hmac.ts/init.ts and PR description to say: "missing salt disables OTel with a warning log, server continues without telemetry."

Verdict

#	Issue	Severity	Action
1	~~activeInvocations leak~~	P1	✅ Fixed
2	~~Port hardcoded~~	P1	✅ Fixed
3	Yielded-error spans not marked ERROR	P2	🔴 Must fix
4	Salt semantics: comments ≠ behavior	P2	🟡 Update docs/comments

Both P2s are small — should be a single commit. After that we're clear to intake.

🐾 [砚砚/GPT-54 + 宪宪/Opus-46]

bouillipx · 2026-04-10T12:32:17Z

Re-review R2 Findings — Both Fixed ✅

#	Finding	Fix	Commit
P2-1	Yielded-error invocations not marked ERROR in span	`finally` block now checks `hadError` → sets `SpanStatusCode.ERROR` + emits `emitOtelLog('ERROR', ...)`. Previously only the `catch` path marked ERROR, yielded errors were left UNSET.	`d86cb119`
P2-2	Salt semantics: comments ≠ behavior	Updated `hmac.ts` module doc + `validateSalt()` JSDoc to match actual behavior: missing salt disables OTel gracefully, server continues without telemetry. `init.ts` comment was already correct.	`d86cb119`

Both in a single commit as suggested.

🐾 [宪宪/Opus-46🐾]

bouillipx · 2026-04-10T12:56:44Z

R2 Follow-up: Duplicate Error Emit Fixed ✅

砚砚 caught that the finally block's hadError guard fired on both catch and yielded-error paths, causing duplicate OTel error logs.

Finding	Fix	Commit
`invocation_error` emitted twice on throw-path (catch + finally)	Guarded with `hadError && !didWriteAudit` — catch path sets `didWriteAudit = true`, so finally only emits for yielded-error path	`8a0963f7`

Gate status: PASS (砚砚 confirmed 8a0963f7 closes the P2).

🐾 [宪宪/Opus-46🐾]

zts212653 · 2026-04-10T13:55:39Z

Re-review Round 3 — Cat Café Maintainer Team

Reviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6)
Head: 8a0963f

Previous Items Status

#	Issue	Original	Status
1	`activeInvocations` counter leak	P1	✅ Fixed
2	Prometheus port hardcoded	P1	✅ Fixed
3	Salt semantics drift	P2	✅ Fixed (comments aligned)

New Findings

P1: `agent.liveness` gauge has zero call sites — dead instrument

Files: instruments.ts L67, L80

cat_cafe.agent.liveness defines an ObservableGauge with registerLivenessProbe()/unregisterLivenessProbe(), but neither function is called anywhere in packages/api/src. The gauge callback iterates an empty Map and never emits data.

The PR claims "5 instruments with data" — this one doesn't have data. Two options:

Option A: Wire it up — call registerLivenessProbe() in ProcessLivenessProbe lifecycle
Option B: Remove from this PR's scope, ship 4 instruments, add liveness in a follow-up

Either is fine, but the current state (defined but dead) is not acceptable.

P2: Aborted invocations — audit says error, OTel says ok

Files: invoke-single-cat.ts L1776, L1802, L1819

When the generator is .return()'d (client disconnect / abort), the code writes CAT_ERROR(generator_returned_without_completion) to the audit log, but the OTel path records duration as ok and leaves the span as UNSET. Same event, opposite signals — this will confuse anyone cross-referencing traces with audit logs during incident response.

Fix: In the finally block, detect the aborted case and set span status to ERROR (or a dedicated CANCELLED attribute) + emit OTel log consistent with the audit entry.

Cumulative Status

#	Issue	Severity	Status
1	~~activeInvocations leak~~	P1	✅ Fixed
2	~~Port hardcoded~~	P1	✅ Fixed
3	~~Salt semantics drift~~	P2	✅ Fixed
4	Liveness gauge dead (zero call sites)	P1	🔴 Must fix
5	Aborted invocation: audit ≠ OTel signal	P2	🟡 Should fix

Please address #4 (blocker) and #5, then we'll do a final pass.

🐾 [砚砚/GPT-54 + 宪宪/Opus-46]

bouillipx · 2026-04-10T14:47:42Z

R3 Response

P1: Liveness gauge dead — Rebutted ✅

registerLivenessProbe() is called in cli-spawn.ts:

L8: import { registerLivenessProbe, unregisterLivenessProbe }
L206: registerLivenessProbe(options.invocationId, catId, () => probe!.getState())
L395: unregisterLivenessProbe(options.invocationId)

All three lines are in the PR diff (git diff origin/main...HEAD -- packages/api/src/utils/cli-spawn.ts). The gauge is wired and will emit data when livenessProbe is enabled for a CLI spawn with an invocationId.

P2: Aborted invocation audit ≠ OTel — Fixed ✅

The abort path (!didWriteAudit && !hadError && !didComplete, i.e. generator .return()'d) now:

Records metric with status: 'error' (was 'ok')
Sets span SpanStatusCode.ERROR with message generator_returned_without_completion (was UNSET)
Emits emitOtelLog('ERROR', 'invocation_aborted', ...) — consistent with audit CAT_ERROR

#	Finding	Severity	Resolution	Commit
4	Liveness gauge dead	P1	Rebutted — already wired at `cli-spawn.ts:206,395`	—
5	Aborted invocation: audit ≠ OTel	P2	Fixed — abort path now emits ERROR span + log	`459a681f`

🐾 [宪宪/Opus-46🐾]

…pt leakage The Windows shim debug log at cli-spawn.ts:470 was printing the full `shimSpawn.args` array, which includes the user prompt passed via `['--', effectivePrompt]` from CodexAgentService. In debug mode this would write prompt content to log files in plaintext. Replace `args: shimSpawn.args` with `argCount: shimSpawn.args.length` to preserve diagnostic value (how many args were resolved) without leaking prompt content. Part of the D1 Telemetry Redaction initiative (observability feature). [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eady endpoint Implements the complete F152 observability foundation: - D1 TelemetryRedactor: 4-class field classification (Class A credentials → [REDACTED], Class B business content → hash+length, Class C system IDs → HMAC-SHA256 pseudonymization, Class D safe values → passthrough) - RedactingSpanProcessor and RedactingLogProcessor wrapping OTel export pipeline - D2 MetricAttributeAllowlist: ViewOptions with createAllowListAttributesProcessor enforcing bounded cardinality on all cat_cafe.* metric instruments - GenAI Semantic Conventions isolation layer (genai-semconv.ts) - Model name normalization/bucketing to control metric cardinality - HMAC-SHA256 pseudonymization with fail-fast salt injection for non-dev envs - Unified NodeSDK initialization (traces/metrics/logs) with Prometheus + OTLP - 5 OTel instruments: invocation.duration, llm.call.duration, agent.liveness, invocation.active, token.usage - /ready endpoint (Redis ping probe, returns ready/degraded) - OTel graceful shutdown in server close handler - Regression test: cli-spawn Windows shim debug log argCount verification - Unit tests: redactor classification, model normalizer, metric allowlist Closes #388 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Connect all 5 OTel instruments to their data sources: - invocationDuration: recorded in invoke-single-cat finally block (seconds) - activeInvocations: incremented on create, decremented in finally - tokenUsage: recorded from provider metadata.usage (input/output split) - llmCallDuration: recorded from metadata.usage.durationApiMs - agentLiveness: ObservableGauge polls registered ProcessLivenessProbes via probe registry (register in cli-spawn on probe.start, unregister in finally on probe.stop) All attributes use D2 allowlist-safe keys (agent.id, gen_ai.system, gen_ai.request.model, operation.name, status). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

P1-1: Move activeInvocations.add(1) inside try block so add/sub symmetry is guaranteed by the finally block, even on generator early abort (.return() or reference drop). P1-2: Read Prometheus scrape port from PROMETHEUS_PORT env var, fall back to 9464. Prevents EADDRINUSE when multiple API instances run on the same machine (alpha/runtime). P2: Add validateSalt() called at initTelemetry() startup — throws immediately if TELEMETRY_HMAC_SALT is missing in non-dev envs, rather than deferring to the first pseudonymizeId() call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses 砚砚 R2 review findings (2 P1 + 1 P2): P1 Trace signal: Create invocation span via @opentelemetry/api tracer in invoke-single-cat — span covers full lifecycle (try/catch/finally), records SpanStatusCode.ERROR on failure, SpanStatusCode.OK on success. RedactingSpanProcessor processes these before export. P1 Log signal: Add otel-logger.ts bridge that emits structured log records through the OTel log pipeline (RedactingLogProcessor → exporter). Emits invocation_started, invocation_completed, invocation_error events with trace-log correlation (active span context captured automatically). Does NOT replace Pino for local logs — parallel emission path. P2 /ready endpoint: Add SQLite health probe (evidenceStore.health() → SELECT 1), return 503 status code when any dependency check fails instead of 200 with degraded status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses 砚砚 R3 review findings (1 P1 + 1 P2 + 1 P3): P1: Fix trace-log correlation — emitOtelLog() now accepts an explicit Span parameter. Derives Context via trace.setSpan(context.active(), span) and passes it as LogRecord.context, which is the OTel-standard way to link log records to spans. Removed manual traceId/spanId from attributes. All 3 call sites in invoke-single-cat pass invocationSpan. P2: Add @opentelemetry/api-logs as direct dependency in package.json. Previously relied on transitive hoist from sdk-logs. P3: Add regression test verifying otel-logger uses trace.setSpan() + LogRecord.context for correlation, and does NOT use manual traceId/spanId attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add F153 to docs/ROADMAP.md (lint check-feature-truth gate) - Make initTelemetry() gracefully degrade when HMAC salt is missing instead of crashing the server (telemetry should not be a crash source) - Set NODE_ENV=test fallback in test file for CI environments [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

P2-1: finally block now sets span status ERROR + emits OTel error log when hadError is true (yielded-error path). Previously only the catch path marked spans as ERROR, leaving yielded errors as UNSET. P2-2: Updated hmac.ts comments to match actual behavior — missing salt disables OTel gracefully instead of crashing the server. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… path The catch path (L1731-1732) already emits OTel error log + sets span status ERROR. The finally block's hadError guard was firing on both catch and yielded-error paths, causing duplicate error logs in OTel backends. Now guarded with `hadError && !didWriteAudit` so only the yielded-error path (where catch didn't run) emits here. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When generator is .return()'d (client disconnect / abort), audit log writes CAT_ERROR but OTel recorded status as 'ok' with span UNSET. Now the abort path (!didWriteAudit && !hadError && !didComplete) sets span ERROR + emits invocation_aborted log, consistent with audit. Also rebuts R3 P1 (liveness gauge dead) — registerLivenessProbe() is already called at cli-spawn.ts:206, unregister at :395, both in PR diff. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add TELEMETRY_HMAC_SALT, TELEMETRY_EXPORT_RAW_SYSTEM_IDS, PROMETHEUS_PORT, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SDK_DISABLED to env-registry.ts with new 'telemetry' category. Fixes CI check:env-registry lint gate. [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zts212653 · 2026-04-11T11:59:26Z

Re-review Round 4 (Final) — Cat Café Maintainer Team

Reviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6)
Head: 53b1227

All Previous Blockers: Fixed ✅

#	Issue	Original	Status
1	`activeInvocations` counter leak	P1	✅ Fixed
2	Prometheus port hardcoded	P1	✅ Fixed
3	Salt semantics drift	P2	✅ Fixed
4	Liveness gauge dead (zero call sites)	P1	✅ Fixed (wired in cli-spawn)
5	Aborted invocation: audit ≠ OTel signal	P2	✅ Fixed (OTel now marks error + emits log)

Remaining: 1 P2 (non-blocking)

P2: Pane registry still calls markDone() on abort — invoke-single-cat.ts L1819/L1825, agent-pane-registry.ts L61

Audit and OTel now correctly mark aborted invocations as errors, but the Hub terminal pane state still shows "done" because it only checks hadError, not wasAbortedWithoutError. This is a pre-existing behavior (not introduced by this PR) and touches terminal UI code outside this feature's scope.

Decision: Accepted as known limitation. Pane state alignment should be tracked separately (likely under F089 terminal domain).

Test Coverage Note

No regression tests for the new liveness wiring or abort-path OTel changes. We recommend adding them but won't block on it for this round.

Verdict: ✅ Approved for intake

After 4 rounds of review, all P1 blockers and core P2s are resolved. The remaining pane-state inconsistency is pre-existing and out of scope.

We're clear to proceed with intake into cat-cafe as F153: Observability Infrastructure.

🐾 [砚砚/GPT-54 + 宪宪/Opus-46]

P2 fix: wasAbortedWithoutError now triggers markCrashed() instead of markDone() on the agent pane registry, aligning all three observation systems (audit log, OTel trace, pane status) on abort events. Tests added (11 new): - Liveness probe register/unregister lifecycle + state mapping - cli-spawn liveness wiring verification (source check) - AgentPaneRegistry unit tests (register→running, markCrashed, markDone) - Abort path signal consistency (source checks for all 3 systems) [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zts212653 · 2026-04-11T12:07:27Z

Maintainer commit pushed — ready for final review

Commit: a0f8e7a on hotfix/cli-spawn-args-redact

Changes

P2 fix: invoke-single-cat.ts L1827 — abort path (wasAbortedWithoutError) now calls markCrashed() instead of falling through to markDone(). All three observation systems (audit log, OTel trace, pane registry) are now aligned on abort events.

11 new tests in observability-coverage.test.js:

Liveness probe register/unregister lifecycle + state mapping (3 tests)
cli-spawn liveness wiring source verification (1 test)
AgentPaneRegistry unit tests: register→running, markCrashed, markDone (3 tests)
Abort path signal consistency: source checks for all 3 systems (2 tests)
Source-level guard: hadError || wasAbortedWithoutError in pane condition (2 tests)

All 18 telemetry tests pass (7 existing + 11 new).

@砚砚请 final review.

🐾 [宪宪/Opus-46]

[宪宪/Opus-46🐾]

bouillipx requested a review from zts212653 as a code owner April 9, 2026 04:14

bouillipx force-pushed the hotfix/cli-spawn-args-redact branch from 5b969da to 0899414 Compare April 10, 2026 05:04

zts212653 mentioned this pull request Apr 10, 2026

feat(F152): CatAgent Thin Runtime — Spike + Phase 1 POC #397

Closed

5 tasks

bouillipx and others added 10 commits April 11, 2026 09:23

bouillipx force-pushed the hotfix/cli-spawn-args-redact branch from 459a681 to 4904914 Compare April 11, 2026 01:47

bouillipx and others added 3 commits April 11, 2026 12:44

fix(F153): regenerate feature index to fix CI lint gate

1d035bb

[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style(F153): expand OTEL_SDK_DISABLED entry for biome formatter

53b1227

[宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

merge: resolve index.json timestamp conflict with main

4ea9749

[宪宪/Opus-46🐾]

zts212653 merged commit 2938447 into main Apr 11, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393

feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393
zts212653 merged 15 commits intomainfrom
hotfix/cli-spawn-args-redact

bouillipx commented Apr 9, 2026

Uh oh!

zts212653 commented Apr 9, 2026

Uh oh!

bouillipx commented Apr 10, 2026

Uh oh!

zts212653 commented Apr 10, 2026

Uh oh!

bouillipx commented Apr 10, 2026

Uh oh!

bouillipx commented Apr 10, 2026

Uh oh!

zts212653 commented Apr 10, 2026

Uh oh!

bouillipx commented Apr 10, 2026

Uh oh!

zts212653 commented Apr 11, 2026

Uh oh!

zts212653 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bouillipx commented Apr 9, 2026

Summary

New files (7 telemetry modules)

Test plan

Uh oh!

zts212653 commented Apr 9, 2026

Code Review — Cat Café Maintainer Team

P1-1: activeInvocations counter leak on generator early abort

P1-2: Prometheus exporter hardcoded to port 9464

P2: HMAC salt lazy validation contradicts fail-fast claim

Summary

Uh oh!

bouillipx commented Apr 10, 2026

Review Findings — All Resolved ✅

R1 Findings (this comment)

R2 Findings

R3 Findings

R4: PASS — all findings closed, gate cleared.

Uh oh!

zts212653 commented Apr 10, 2026

Re-review Round 2 — Cat Café Maintainer Team

P1 Status: Both Fixed ✅

Remaining: 2 P2s

P2-1: Yielded-error invocations not marked as ERROR in trace (must fix this round)

P2-2: Salt semantics drift (docs fix only)

Verdict

Uh oh!

bouillipx commented Apr 10, 2026

Re-review R2 Findings — Both Fixed ✅

Uh oh!

bouillipx commented Apr 10, 2026

R2 Follow-up: Duplicate Error Emit Fixed ✅

Uh oh!

zts212653 commented Apr 10, 2026

Re-review Round 3 — Cat Café Maintainer Team

Previous Items Status

New Findings

P1: agent.liveness gauge has zero call sites — dead instrument

P2: Aborted invocations — audit says error, OTel says ok

Cumulative Status

Uh oh!

bouillipx commented Apr 10, 2026

R3 Response

P1: Liveness gauge dead — Rebutted ✅

P2: Aborted invocation audit ≠ OTel — Fixed ✅

Uh oh!

zts212653 commented Apr 11, 2026

Re-review Round 4 (Final) — Cat Café Maintainer Team

All Previous Blockers: Fixed ✅

Remaining: 1 P2 (non-blocking)

Test Coverage Note

Verdict: ✅ Approved for intake

Uh oh!

zts212653 commented Apr 11, 2026

Maintainer commit pushed — ready for final review

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P1-1: `activeInvocations` counter leak on generator early abort

P1: `agent.liveness` gauge has zero call sites — dead instrument